As graduate students, debt is all around us. Fortunately, we are also data scientists, and data is all around us, too. Most of the major milestones in life - such as graduating from college or a graduate program, or buying a house - require taking on debt. A report from Time’s Money Magazine based on the Federal Reserve’s Survey of Consumer Finances found that on average, Americans under 35 owe $67,400. For middle-aged Americans, the average is even higher, ranging from $108,300 for 55-64 year olds to $134,600 for 45-54 year olds (http://time.com/money/5233033/average-debt-every-age/).
We are interested in digging deeper into how loan applications are considered, and better understanding factors which might be considered by lending agents when applying for a loan, from the applicant’s perspective. Alternatively, our analysis could be useful from the lender’s perspective in identifying factors which predict loan defaults. The goal for our final project was to build a prediction model using the LendingClub* loans data set. Specifically, we wanted to build a model that could predict a dichotomized outcome of loan status (which will be explained in more detail below) using the variables given in the data set.
On this page, we will walk you through our preliminary data exploration. On the methods page, we will walk you through our process for building our final prediction model.
*Lending Club is a United States based peer-to-peer lending platform. It is the world’s largest of such companies, facilitating loan exchange between borrowers and investors. Lending Club enables borrows to create unsecured personal loans between $1,000 and $40,000. Investors can search a database of loan listings on their website and select any number of loans to invest in.
rm(list = ls())
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(tidyverse)
library(funModeling)
library(caret)
library(VIM)
library(mice)
library(ggcorrplot)
library(plotly)
library(pROC)
library(lubridate)
library(glmnet)
library(broom)
library(MASS)
library(usmap)
library(RColorBrewer)
set.seed(1)
set_dir <- '/Users/shuheimiyasaka/Google Drive/BST 260 Final Project/For Submission'
Our data is available via kaggle at https://www.kaggle.com/wendykan/lending-club-loan-data. This data is unique from most other financial institution’s data because of the lending method used by ‘LendingClub’. Headquartered in San Francisco, LendingClub connects borrowers applying for personal loans, auto refinancing, business loans, and elective medical procedures with investors. LendingClub reports that it is America’s largest online marketplace, and emphasizes the digital aspects of its model (https://www.lendingclub.com/). LendingClub also services the loans, and therefore maintains data on the loans’ statuses, as well as information about the loan application.
setwd(set_dir)
load('./loan.Rdata')
#loan.dat <- read.csv('loan.csv', header = TRUE)
#save(loan.dat, file = "./loan.RData")
The data set has 887,379 records with 74 variables.
dim(loan.dat)
## [1] 887379 74
names(loan.dat)
## [1] "id" "member_id"
## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
## [29] "mths_since_last_delinq" "mths_since_last_record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_status"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_mths_ex_med"
## [51] "mths_since_last_major_derog" "policy_code"
## [53] "application_type" "annual_inc_joint"
## [55] "dti_joint" "verification_status_joint"
## [57] "acc_now_delinq" "tot_coll_amt"
## [59] "tot_cur_bal" "open_acc_6m"
## [61] "open_il_6m" "open_il_12m"
## [63] "open_il_24m" "mths_since_rcnt_il"
## [65] "total_bal_il" "il_util"
## [67] "open_rv_12m" "open_rv_24m"
## [69] "max_bal_bc" "all_util"
## [71] "total_rev_hi_lim" "inq_fi"
## [73] "total_cu_tl" "inq_last_12m"
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]
As part of data exploration, we examined with percentage of “zeros”, missing records, and unique values in the data set per variable as shown above. From the table above, we notice a number of variables with significant amount of missing data.
Based on examining the data set and reading the data dictionary, we decided to immediately rule out the following variables from our model: id, member_id, url, and desc.
cols.2.remove <- c('id', 'member_id', 'url', 'desc')
We decided to exclude variables with more than 10% missing data (19 variables).
missing.data.col <- meta_loans$variable[meta_loans$p_na > 10.]
missing.data.col
## [1] "mths_since_last_delinq" "mths_since_last_record"
## [3] "mths_since_last_major_derog" "annual_inc_joint"
## [5] "dti_joint" "open_acc_6m"
## [7] "open_il_6m" "open_il_12m"
## [9] "open_il_24m" "mths_since_rcnt_il"
## [11] "total_bal_il" "il_util"
## [13] "open_rv_12m" "open_rv_24m"
## [15] "max_bal_bc" "all_util"
## [17] "inq_fi" "total_cu_tl"
## [19] "inq_last_12m"
length(missing.data.col)
## [1] 19
cols.2.remove <- c(cols.2.remove, missing.data.col)
meta_loans[order(meta_loans$unique),]
cols.2.remove <- c(cols.2.remove, 'policy_code')
We also decided to remove policy_code since it only has one unique value.
At this point, we had 50 potential covariates:
cols.2.keep <- !(colnames(loan.dat) %in% cols.2.remove)
colnames(loan.dat)[cols.2.keep]
## [1] "loan_amnt" "funded_amnt"
## [3] "funded_amnt_inv" "term"
## [5] "int_rate" "installment"
## [7] "grade" "sub_grade"
## [9] "emp_title" "emp_length"
## [11] "home_ownership" "annual_inc"
## [13] "verification_status" "issue_d"
## [15] "loan_status" "pymnt_plan"
## [17] "purpose" "title"
## [19] "zip_code" "addr_state"
## [21] "dti" "delinq_2yrs"
## [23] "earliest_cr_line" "inq_last_6mths"
## [25] "open_acc" "pub_rec"
## [27] "revol_bal" "revol_util"
## [29] "total_acc" "initial_list_status"
## [31] "out_prncp" "out_prncp_inv"
## [33] "total_pymnt" "total_pymnt_inv"
## [35] "total_rec_prncp" "total_rec_int"
## [37] "total_rec_late_fee" "recoveries"
## [39] "collection_recovery_fee" "last_pymnt_d"
## [41] "last_pymnt_amnt" "next_pymnt_d"
## [43] "last_credit_pull_d" "collections_12_mths_ex_med"
## [45] "application_type" "verification_status_joint"
## [47] "acc_now_delinq" "tot_coll_amt"
## [49] "tot_cur_bal" "total_rev_hi_lim"
length(colnames(loan.dat)[cols.2.keep])
## [1] 50
loan.dat <- loan.dat[, cols.2.keep]
We also decided to remove 6 records with missing or zero annual income since we felt this information was a requirement for obtaining a loan and a covariate that we must definitely include in our final model (and didn’t feel we could impute these values properly)!
query = loan.dat$annual_inc == 0.
query.na = is.na(query)
if (sum(query.na) > 0){
query[query.na] = TRUE
}
if (sum(query) > 0){
loan.dat = loan.dat[!query,]
} else stop('unexpected case')
With the remaining set of records and covariates, we decided to examine the pairwise correlation of covariates:
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']
cor.dat <- cor(loan.dat[,numeric_cols], loan.dat[,numeric_cols])
plot_ly(x=colnames(cor.dat),
y=rownames(cor.dat),
z = cor.dat, type = "heatmap", colorscale="Greys")
#ggcorrplot(cor(loan.dat[,numeric_cols]))
#aggr(loan.dat, combined=T, cex.axis=0.6)
We notice from the plot above that there are a few covariates that are highly correlated (which is not unexpected).
We also calculated basic summary statistics of our covariates to help us better understand the data:
summary(loan.dat)
## loan_amnt funded_amnt funded_amnt_inv term
## Min. : 500 Min. : 500 Min. : 0 36 months:621119
## 1st Qu.: 8000 1st Qu.: 8000 1st Qu.: 8000 60 months:266254
## Median :13000 Median :13000 Median :13000
## Mean :14755 Mean :14742 Mean :14703
## 3rd Qu.:20000 3rd Qu.:20000 3rd Qu.:20000
## Max. :35000 Max. :35000 Max. :35000
##
## int_rate installment grade sub_grade
## Min. : 5.32 Min. : 15.67 A:148198 B3 : 56323
## 1st Qu.: 9.99 1st Qu.: 260.71 B:254535 B4 : 55626
## Median :12.99 Median : 382.55 C:245859 C1 : 53387
## Mean :13.25 Mean : 436.72 D:139541 C2 : 52235
## 3rd Qu.:16.20 3rd Qu.: 572.60 E: 70705 C3 : 50161
## Max. :28.99 Max. :1445.46 F: 23046 C4 : 48857
## G: 5489 (Other):570784
## emp_title emp_length home_ownership
## : 51451 10+ years:291569 ANY : 3
## Teacher : 13469 2 years : 78870 MORTGAGE:443555
## Manager : 11240 < 1 year : 70601 NONE : 46
## Registered Nurse: 5525 3 years : 70026 OTHER : 182
## Owner : 5376 1 year : 57095 OWN : 87470
## RN : 5355 5 years : 55704 RENT :356117
## (Other) :794957 (Other) :263508
## annual_inc verification_status issue_d
## Min. : 1200 Not Verified :266744 Oct-2015: 48631
## 1st Qu.: 45000 Source Verified:329558 Jul-2015: 45962
## Median : 65000 Verified :291071 Dec-2015: 44341
## Mean : 75028 Oct-2014: 38782
## 3rd Qu.: 90000 Nov-2015: 37529
## Max. :9500000 Aug-2015: 35886
## (Other) :636242
## loan_status pymnt_plan purpose
## Current :601777 n:887363 debt_consolidation:524214
## Fully Paid :207723 y: 10 credit_card :206181
## Charged Off : 45248 home_improvement : 51829
## Late (31-120 days): 11591 other : 42890
## Issued : 8460 major_purchase : 17277
## In Grace Period : 6253 small_business : 10377
## (Other) : 6321 (Other) : 34605
## title zip_code addr_state
## Debt consolidation :414000 945xx : 9770 CA :129517
## Credit card refinancing:164330 750xx : 9417 NY : 74082
## Home improvement : 40112 112xx : 9272 TX : 71136
## Other : 31892 606xx : 8641 FL : 60935
## Debt Consolidation : 15760 300xx : 8126 IL : 35476
## Major purchase : 12051 100xx : 7605 NJ : 33256
## (Other) :209228 (Other):834542 (Other):482971
## dti delinq_2yrs earliest_cr_line inq_last_6mths
## Min. : 0.00 Min. : 0.0000 Aug-2001: 6659 Min. : 0.0000
## 1st Qu.: 11.91 1st Qu.: 0.0000 Aug-2000: 6529 1st Qu.: 0.0000
## Median : 17.65 Median : 0.0000 Oct-2000: 6322 Median : 0.0000
## Mean : 18.13 Mean : 0.3144 Oct-2001: 6154 Mean : 0.6946
## 3rd Qu.: 23.95 3rd Qu.: 0.0000 Aug-2002: 6086 3rd Qu.: 1.0000
## Max. :1092.52 Max. :39.0000 Sep-2000: 5918 Max. :33.0000
## NA's :25 (Other) :849705 NA's :25
## open_acc pub_rec revol_bal revol_util
## Min. : 0.00 Min. : 0.0000 Min. : 0 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.: 0.0000 1st Qu.: 6443 1st Qu.: 37.70
## Median :11.00 Median : 0.0000 Median : 11875 Median : 56.00
## Mean :11.55 Mean : 0.1953 Mean : 16921 Mean : 55.07
## 3rd Qu.:14.00 3rd Qu.: 0.0000 3rd Qu.: 20829 3rd Qu.: 73.60
## Max. :90.00 Max. :86.0000 Max. :2904836 Max. :892.30
## NA's :25 NA's :25 NA's :498
## total_acc initial_list_status out_prncp out_prncp_inv
## Min. : 1.00 f:456843 Min. : 0 Min. : 0
## 1st Qu.: 17.00 w:430530 1st Qu.: 0 1st Qu.: 0
## Median : 24.00 Median : 6459 Median : 6456
## Mean : 25.27 Mean : 8403 Mean : 8400
## 3rd Qu.: 32.00 3rd Qu.:13659 3rd Qu.:13654
## Max. :169.00 Max. :49373 Max. :49373
## NA's :25
## total_pymnt total_pymnt_inv total_rec_prncp total_rec_int
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 1915 1st Qu.: 1900 1st Qu.: 1201 1st Qu.: 441.5
## Median : 4895 Median : 4862 Median : 3215 Median : 1073.3
## Mean : 7559 Mean : 7521 Mean : 5758 Mean : 1754.8
## 3rd Qu.:10617 3rd Qu.:10566 3rd Qu.: 8000 3rd Qu.: 2238.3
## Max. :57778 Max. :57778 Max. :35000 Max. :24205.6
##
## total_rec_late_fee recoveries collection_recovery_fee
## Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 0.3967 Mean : 45.92 Mean : 4.881
## 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :358.6800 Max. :33520.27 Max. :7002.190
##
## last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d
## Jan-2016:470148 Min. : 0.0 Feb-2016:553404 Jan-2016:730572
## Dec-2015:150861 1st Qu.: 280.2 :252971 Dec-2015: 19308
## : 17659 Median : 462.8 Jan-2016: 78195 Nov-2015: 11490
## Oct-2015: 16000 Mean : 2164.2 Mar-2011: 107 Oct-2015: 10419
## Jul-2015: 14483 3rd Qu.: 831.2 Apr-2011: 101 Sep-2015: 10087
## Nov-2015: 13981 Max. :36475.6 Feb-2011: 91 Jul-2015: 8642
## (Other) :204241 (Other) : 2504 (Other) : 96855
## collections_12_mths_ex_med application_type
## Min. : 0.00000 INDIVIDUAL:886864
## 1st Qu.: 0.00000 JOINT : 509
## Median : 0.00000
## Mean : 0.01438
## 3rd Qu.: 0.00000
## Max. :20.00000
## NA's :141
## verification_status_joint acc_now_delinq tot_coll_amt
## :886864 Min. : 0.000000 Min. : 0
## Not Verified : 281 1st Qu.: 0.000000 1st Qu.: 0
## Source Verified: 61 Median : 0.000000 Median : 0
## Verified : 167 Mean : 0.004991 Mean : 226
## 3rd Qu.: 0.000000 3rd Qu.: 0
## Max. :14.000000 Max. :9152545
## NA's :25 NA's :70272
## tot_cur_bal total_rev_hi_lim
## Min. : 0 Min. : 0
## 1st Qu.: 29853 1st Qu.: 13900
## Median : 80559 Median : 23700
## Mean : 139458 Mean : 32069
## 3rd Qu.: 208205 3rd Qu.: 39800
## Max. :8000078 Max. :9999999
## NA's :70272 NA's :70272